1 Morning

1.1 Gene ID conversion using DAVID

1.1.1 Prepare input data

Download the Ensembl Gene list data we’ll be using:

https://wd.cri.uic.edu/pathway/GeneID_Conversion_list.txt

1.1.3 Upload to DAVID

  • Step 1: Click “Choose File” and browse to the gene list text file that downloaded
  • Step 2: Choose “ENSEMBL_GENE_ID” from the Select Identifier drop-down
  • Step 3: Select the “Gene List” radio button
  • Step 4: Submit List to DAVID.

1.1.4 Submit to conversion tool

  • Step 5: Choose “GeneID Conversion Tool”

  • Step 5: Choose “UNIPROT_ID” from the drop down menu
  • Step 6: Enter “Mus musculus” for the species
  • Step 7: Click on “Submit to Conversion Tool” to submit the gene list for conversion*

1.1.5 Results from David

  • Review “Gene Accession Conversion Statistics”
    • Our GeneID_Conversion_list.txt had 46 genes in the list and 42 are converted
  • Right-click on the “Download File” link on right hand corner above the results to save as a file. Save it as david_id_conversion_results.tsv.
    • Open in Excel and review

  • Observations:
    • Open david_id_conversion.tsv in Excel
    • 329 UniProt IDs records are returned for 42 ENSEMBL ID
    • Many genes have one to many relationship

1.2 Gene ID conversion using UniProt

We’re going to use the same Ensemble gene list as in the DAVID exercise:

https://wd.cri.uic.edu/pathway/GeneID_Conversion_list.txt

1.2.2 Upload to UniProt

  • Step 1 Click “1. load from a text file”
    • Click “Choose File” and browse to the gene list text file that was downloaded
  • Step 2 Under “2. Select options”
    • Choose “ENSEMBL” from the “Genome Annotation Databases” on the Select Identifier drop-down “From”
    • Choose “UniProtKB” from the “UniProt” on the Select Identifier drop-down “To”
    • In “Name Your ID Mapping Job” put ensemble_to_uniprot
    • Click “Map 46 IDS”
    • To view mapping results click on Completed by your job name ensemble_to_uniprot under Tool results

1.2.3 Results from UniProt

UniProt Knowledgebase has two databases

  • Swiss-Prot: Proteins that are manually annotated and reviewed
  • TrEMBL: Proteins that are automatically annotated and not reviewed

Observations:

  • 41 out 46 Ensemble identifiers are successfully mapped to 102 UniProtKB IDs
    • For many Ensembl gene IDs we have multiple UniProt IDs
    • Out of 102 records, 39 records have Swiss-Prot Ids and 63 have TrEMBL IDs
  • Click pencil icon by ‘Customize columns’ - to review other available annotations.

  • Review other available annotations
    • Select “Gene ontology IDs” under “Gene Ontology (GO)”
    • Hit “Save” button once done

  • Review results with other added annotations
  • Download the results. Various formats are available e.g fasta, gff, tab, excel

1.3 Gene ID conversion using BioMart

We’re going to use the same Ensembl gene list as in the DAVID exercise:

https://wd.cri.uic.edu/pathway/GeneID_Conversion_list.txt

1.3.2 Choose Database(s)

  • Step 1 Choose “Ensembl Genes 112” from “CHOOSE DATABASE”
  • Step 2 Choose “Mouse genes (GRCm39)” from “CHOOSE DATASET”

Note: This will automatically load left panel with Filters and Attributes.

1.3.3 Upload to BioMart

  • Step 1 Click “Filters”. This will open “Filter” panel in the middle.
  • Step 2 Click “+” sign beside “GENE” on left hand side. We can filter query results based on any one of the seven options, e.g. Region, Gene.

  • Step 3
    • Clicking on “Gene” will expand and will show option to upload the file.
    • Click “Choose File” and browse to the gene list text file that we downloaded GeneID_Conversion_list.txt. Note we are using Ensembl Gene IDs as our input so we do not need to change anything in select identifier dropdown.

Select Attributes: Click “Attributes” in the left panel. This will open “Attributes” panel in the middle.

  • Step 1: “Features” is selected by default

  • Step 2: Click “+” beside “External”
    • Select “Go term accession” under “GO”
    • Select “UniProtKB Gene Name ID” under External References (max 3)
    • Select “UniProtKB/Swiss-Prot ID” under External References (max 3)
    • Select “UniProtKB/TrEMBL ID” under External References (max 3)

1.3.4 Results from BioMart

  • Results
    • Make sure all attributes of interest are selected.
    • Click “Results” in top left panel menu.

1.3.5 Saving the results

  • Top 10 results will be displayed if your query generates the results.
  • To Save the results to file: “Export all results to”
    • Choose “compressed file (.gz)” from drop down
    • Choose “TSV” from the dropdown
    • Choose “Unique results only”
    • Click go
    • Look for mart_export.txt.gz file in the Downloads folder.

Note for larger query one can choose to get Email notification.

1.4 Using QuickGO

1.4.2 Search using GO ID

  • View GO ID Results
    • Explore results by clicking on left menu for tabs, e.g. “Synonyms”, “Ancestor Chart”, or “Child terms”

  • Filter data
    • Click on “165,834 annotations” as shown in the previous image
    • Explore different tabs like “Taxon”, “Go terms”
    • Click “Taxon” to filter the results
      • Click “Homo sapiens”
      • Click “Apply”

  • Click “Export” to download the data

  • Click “Statistics” to get Summary

1.4.3 Search using keyword

1.5 Using MSigDB

1.5.2 Explore different collections

For this exercise we will highlight several different collections.

  • Hallmark and KEGG pathway collections given as examples below
  • Also note:
    • C2: Reactome pathways
    • C3: miRNA and TF target genes
    • C5: Gene Ontology databases

  • Hallmark Collection: Hallmark gene sets summarize and represent specific well-defined biological states or processes and display coherent expression. Currently it has total of 50 gene sets.
    • Click “Browse Gene Sets” on the left panel
    • Click “H” to expand the Hallmark collection
      • Scroll to the bottom of the page. It will list of all Gene sets for Hallmark Collection.
      • Click on any one of the gene set collection to view the details and to download the gene set.

  • Review the details for the given gene set collection.

  • C2 collection: Curated gene sets from online pathway databases, publications in PubMed, and knowledge of domain experts. It has various sub-collections for e.g. CP:KEGG_Legacy represents canonical pathway’s in KEGG’s database.
    • Click “Browse Gene Sets” on the left panel
    • Click “C2” to expand the C2 collection
      • Click “CP:KEGG_Legacy” to expand the collection
      • Scroll to the bottom of the page. It will list of all Gene sets for CP:KEGG_Legacy collection.
      • Click on any one of the gene set collection to view the details

  • Review the details for the given gene set collection.

1.5.3 Search gene sets in the collection browsing

  • Click “Browse” in left menu bar. Search term “lung” in Gene set name and see what collections come back. Scroll all the way at bottom to review the collections results.
  • Click “BLANCO_MELO_COVID19_SARS_COV_2_POS_PATIENT_LUNG_TISSUE_DN” from the returned results for review

  • Review results for “BLANCO_MELO_COVID19_SARS_COV_2_POS_PATIENT_LUNG_TISSUE_DN”
  • Download gene set of interest

1.5.4 Search gene sets in all of MSigDB by keyword

  • Click “Search” in the left menu under Mouse Collections
    • Requires to register on the website to search.
  • Type “Apoptosis” in Keywords bar
  • Leave all other values as default

Review results

1.6 Browse KEGG

  1. Open a web browser to https://www.genome.jp/kegg
  2. Click on KEGG PATHWAY in the Data-oriented entry points.
  3. In search (keywords) box enter “VEGF” and click Go.
  4. On Search Results page, click map04370 to view the pathway entry for the VEGF signaling pathway.

1.6.1 Pathway entry view

The KEGG Pathway entry page has the following details for the pathway map04370.

  • Entry - The Entry ID and type (Pathway).
  • Name - Name of the pathway,
  • Description - Short description/definition of the Entry.
  • Class - BRITE categories for the pathway
  • Pathway map - Map of the pathway
  • Other DBs - Links to other databases for this entry.
    • GO - Link to associated gene ontology term.
  • References - Important references.

1.6.2 Pathway View

  1. Click on the picture of the pathway in the Pathway map section.

The KEGG Pathway view, gives a graphical representation of the pathway. In the pathway map, you can observe the following.

  • Proteins/Enzymes are depicted as rectangles with their EC number.
  • Compounds are depicted as small circles with their name adjacent.
  • Reactions are denoted as arrows from substrate to product with associated enzyme components above/below the line.
  • Upstream or downstream pathways are depicted as rectangles with rounded corners.

In this view, you can click on any of the protein/enzymes, compounds or associated pathways to view the details for that entry.

Above the pathway map are the following links

  • Pathway menu - Opens the pathway menu (list of all pathways) for this pathway.
  • Organism menu - Opens the Organism menu (list of all organisms) for this pathway.
  • Pathway entry - Shows the entry (details) page for this pathway.
  • Show description - Show the description for the pathway.
  • Download - Download the pathway information in png format
  • Help - Help information

Side Panel

  • Color - Opens dialog box that allows a user to provide custom colors for entries on the pathway.

Note the Change pathway type button. This opens the organism menu (list of organisms/genomes) listing those organisms that have at least some portion of this pathway.

1.6.3 Orthology Details Page

  1. Click on the VEGF protein rectangle in the left side of the pathway map. This will open the Orthology Details page.

  • Entry - The Entry ID and type (KO for KEGG Orthology).
  • Name - Name of the entry.
  • Pathway - List of KEGG Pathway(s) of which this entry is a member.
  • Disease - List of KEGG Disease of which this entry is a member.
  • BRITE - Functional hierarchies for this entry.
  • Other DBs - Links to other databases for this entry.
    • GO - Link to associated gene ontology term.
  • Genes - Ortholog genes.
  • References - Important references.

1.6.4 Organism specific pathway view

  1. Go back to browser window with the Pathway view page for map04370
  2. Click on the Change pathway type button on top left.

This list displays the organism specific versions of the pathway that are available. The number to the right of the organism name denotes the fraction of genes present in that organism.

  1. Click on hsa to open the Human specific version of the pathway. In this view of the pathway the molecules that are present in the selected organism are highlighted in green.

1.6.5 Gene Details Page

  1. Click on the VEGF protein rectangle in the left side of the organism specific pathway map. The KEGG Gene Details page has the following details for the Gene hsa:7422.

  • Entry - The Entry ID, type (CDS for coding sequence), and ID of source genome.
  • Symbol - Name of the entry
  • Name KO - Short description/definition of the Entry. Include link to associated KEGG Orthology (KO) entry.
  • Organism - Source genome/organism.
  • Pathway - List of Organism specific KEGG Pathway(s) of which this entry is a member.
  • Network - List of perturbed molecular networks
  • Disease - List of Disease
  • Drug Target - List of Drug Target
  • BRITE - Functional hierarchies for this entry.
  • SSDB - Links to sequence similarity database.
    • Ortholog - Orthologous genes found in other organisms.
    • Paralog - Sequence similar genes in the same genome.
    • Gene cluster - Details of the local gene cluster.
    • GFIT - Sequence similarity tables.
  • Motif - List of Pfam protein motifs.
  • Other DBs - Links to other databases for this entry.
    • NCBI-ProteinID - Link to this entry in NCBI protein database.
    • Uniprot - Link to this entry in Uniprot database.
  • Structure - Link to PDB database.
  • LinkDB - Link to all databases.
  • Position - Location of the gene in the genome.
  • AA seq - Translated sequence of the gene.
  • NT seq - Nucleotide sequence of the gene.

1.7 Pathway Enrichment in DAVID

1.7.1 Prepare input data

Download the RNA-seq data we’ll be using:

https://wd.cri.uic.edu/pathway/RNAseq_example.xlsx

Next, we want to generate a list of differentially expressed genes:

  • Open the file in Excel
  • Go to the example_diff tab
  • In the drop-down, over column F ( Infection/Control : QValue ), filter for values less than 0.05

Save these genes to a text file:

  • Select all of the genes in column A (excluding the title line)
    • We’re using Ensembl IDs for the most precise gene identification
  • Open a new Excel file and paste them
  • Save As tab-delimited text to RNAseq_example.txt
    • You can download this file also from:

https://wd.cri.uic.edu/pathway/RNAseq_example.txt

1.7.2 Upload to DAVID

Navigate to DAVID https://david.ncifcrf.gov, and follow the same steps as before to upload the new data set. First choose “Start Analysis” from the top menu.

On the next page:

  • Click “Choose File” in the “Upload” tab and browse to the gene list text file that downloaded
  • Choose “ENSEMBL_GENE_ID” from the Select Identifier drop-down
  • Select the “Gene List” radio button
  • Submit List to DAVID.

After uploading, click the “Functional Annotation Tool” link.

Note that DAVID has auto-detected the species from the ENSEMBL IDs. If we had uploaded gene symbols, we would need to select the correct species here.

We can choose which pathways to run enrichment against. DAVID has these organized into several categories. For today, we will enrich against Gene Ontology Biological Processes, and KEGG pathways (i.e., two databases of biological pathways).

  • Uncheck the “Check Defaults” button
  • Open the Gene_Ontology section and select the GOTERM_BP_FAT option
    • The different numbered options (GOTERM_BP_1, GOTERM_BP_2, etc.) allow you to select different levels of the GO hierarchy
    • ALL is all terms
    • DIRECT omits parent terms
    • FAT is almost all terms, but the very broad ones are filtered out. We recommend this option, as it gives more specificity with a larger set of terms, but removes the very broad ones that would be uninformative.
    • Click on the CHART buttons next to each to get a quick view of how many pathways are included in the analysis, and what their enrichment statistics are.
  • Open the Pathways section and select KEGG_PATHWAYS

Other databases here may be useful for other contexts, so it can be valuable to explore them. For example, for the other GO categories:

  • GO MF has types of molecular functions, such as cytokine binding or kinase activity, which can be useful if you’re trying to characterize classes of molecules.
  • GO CC has localization of molecules in the cell. This can be useful if, for example, you were looking for cell surface markers in a set of differentially expressed genes from a single-cell experiment.

There are also pathways from Reactome available.

1.7.3 Results from DAVID

Select “Functional Annotation Chart” for the pathway enrichment statistics per term. Right-click on the Download File link to save as a file.

  • NOTE: by default DAVID only downloads terms with an intersection of at least 2 molecules and a nominal p-value < 0.1. If you’re running a pathway comparison, as we will in the afternoon, you may want to relax these thresholds for the purposes the comparison. You can do this in the Rerun Using Options drop-down on this page.

Open in Excel to view

  • Please note following picture is from last year.Results would little bit different when you re-run it.
  • Use the FDR column for statistical significance.
  • This table also gives you the different counts that were used in Fisher’s Exact test.
    • Note that the Pop Total column is the total number of genes annotated in the database. Similarly, the List Total is the subset of your DEGs that are annotated to the database. Both of these numbers will change a bit with different databases (e.g., GO BP vs KEGG).
  • Looking through the terms, we can see a large number related to immune response, as would be expected from a viral infection.

A second result from DAVID is a functional annotation clustering, where it attempts to group terms that are similar into “clusters” of terms.

Select “Functional Annotation Clustering” to open this result. There are options to adjust the way that the clustering is performed. Again right-click on the Download File link to save as a file david_cluster.txt.

This can provide a useful way to parse a large list of similar terms. For example:

  • Annotation Cluster 1 relates to response to other organisms and defense
  • Annotation Cluster 2 relates to signal transduction and cell communication
  • Annotation Cluster 3 relates to regulation of immune response

1.8 INSTRUCTOR DEMONSTRATION: Pathway Enrichment in Ingenuity Pathway Analysis (IPA)

We’re going to use the same RNA-seq data set as in the DAVID exercise:

https://wd.cri.uic.edu/pathway/RNAseq_example.xlsx

1.8.1 Upload data to IPA

We will save the entire “example_diff” tab as a tab-delimited text file to upload to IPA

  • We can filter for q-value and/or fold-change within IPA, which is more convenient if we want to try different thresholds later.
  • Uploading the fold-changes also allows IPA to compute its z-scores and make inferences about up- or down-regulation of different pathways.
  • Before uploading to IPA, you may want to make a new project first. In our case, we already have a test project created for the workshop.

To upload to IPA, go to File > Upload Dataset, and browse to the text file made above. IPA will automatically parse the file in the next dialog:

  • Use INFER OBSERVATIONS: automatically identify the columns. Usually IPA does this correctly, but it’s good to double-check:
    • ID column identified as Ensembl
    • Ignore second column: don’t need the gene symbol if we have the Ensembl ID
    • Expr Log ratio is correct, observation name is”Infection/Control : LogFC”
    • Ignore fourth column: don’t need the logCPM
    • Set fifth column to ignored: don’t need p-value, since we’ll use FDR
    • FDR is correct, observation name is also “Infection/Control : LogFC”
  • Check Dataset Summary tab: how many IDs are recognized by IPA. “SHOW DETAILS” button gives more information about unmapped IDs.
  • Click SAVE
    • IPA starts uploading the file
    • Ignore warning about missing metadata
    • Choose which project to save it to

After loading, IPA opens a annotated dataset view, which lets you know how IPA is interpreting all of the genes. We can close this window.

1.8.2 Start core analysis

Pathway enrichment analysis is part of the “Core Analysis”. Other modules include enrichment against upstream regulators, disease biomarkers, toxicity functions, and others. IPA runs everything at once.

  • Browse to the project/dataset (“Workshop Project/Dataset Files”
    • Other folders are generated automatically, used for different analyses
  • Right-click on the dataset file and choose New > Core Analysis
    • Leave the defaults in the next screen: expression analysis, and basing z-scores on expression log ratio.
  • Next customize the analysis: set thresholds for FDR and/or fold-change.
    • Set Expr False Discovery Rate (q-value) threshold to 0.05.
    • We won’t set a threshold for the expression log ratio, but if, for example, you wanted to limit to a 2-fold change, you would choose -1 for Down and 1 for Up.
    • RECALCULATE to update gene list.
    • You can also customize how the databases are used in the other sections, but we’ll leave them all as defaults for now.
  • Click RUN ANALYSIS
    • Confirm which project to save to. Results automatically saved in the Analyses folder.
    • Job is submitted to IPA’s cloud server. Runs in background, and sends email notification when complete. Jobs typically take a couple minutes to run. You can close IPA while it runs.

For today, we’ve run this data set previously, so we’ll skip to those results.

1.8.3 View IPA results

Browse to the Analyses folder and double-click a result to open the core analysis.

  • Different tabs for different databases. In particular:
    • Pathways -> Canonical Pathways are biological/functional pathways, conceptually similar to GO BP or KEGG pathways.
    • Upstream Analysis compare target molecules of regulators to the input. Useful for inferring which regulators may be responsible for the changes in the data.
    • Diseases and Functions are biomarkers for different disease or broad functional categories.

Start with the Canonical Pathways tab.

  • The bar plot shows significance in -log10 scale (i.e., 2 = p-value of 0.01). Bars are colored by z-score; gray bays indicate that the z-scoring model could not be computed for that pathway.
  • IMPORTANT: IPA reports nominal p-value by default. This is wrong: it guarantees many false positives. IPA does compute the B-H corrected p-value (FDR), just need to turn it on:
    • Click the CUSTOMIZE CHART button.
    • At the bottom, change Select Scoring Method to B-H Multiple Testing Correction p-value, then APPLY.
    • Need to make similar adjustments in other tabs.
  • Positive z-scores indicate up-regulated pathways. Since our comparison was infection/control, this means up-regulated in infection.
  • You can also export these results in tabular text or Excel formats, or the barplot in a PDF.

1.8.4 Look at a specific IPA pathway map

If you want more detail into the structure of a particular pathway, double-click it (twice). This will open an interactive image.

  • There are a number of ways to manipulate these images, which we will detail more in the afternoon.

1.9 Pathway Enrichment in R

1.9.1 Data input (overview)

We will be using the same RNA-seq data set as before:

https://wd.cri.uic.edu/pathway/RNAseq_example.xlsx

  • Use RNAseq_example.txt file from the DAVID exercise
  • Compare to a set of anti-viral genes, which were determined from a literature review
  • Also need the total number of genes in the genome, from the “example_norm” tab (save as tab-delimited text to RNAseq_norm.txt)

For consistency, we will obtain these data sets from RIC github repository. However, you can run the same exercise by saving tab-delimited text files from Excel and loading into R.

1.9.2 Enrichment test

Open R Studio to perform this analysis, navigating to the folder where you saved your RNAseq_example.txt and RNAseq_norm.txt files.

# read in gene list from our server
degs <- read.delim("https://wd.cri.uic.edu/pathway/RNAseq_example.txt")
# remake as a vector
degs <- degs[,1]
# read in the antiviral gene list from RIC github repository
antiviral <- read.delim("https://wd.cri.uic.edu/pathway/antiviral_list.txt")
# remake as a vector
antiviral <- antiviral[,1]
head(antiviral)
## [1] "ENSMUSG00000042349" "ENSMUSG00000026896" "ENSMUSG00000027639"
## [4] "ENSMUSG00000028037" "ENSMUSG00000040296" "ENSMUSG00000039853"
# read in norm table from our server
norm <- read.delim("https://wd.cri.uic.edu/pathway/RNAseq_norm.txt",
   row.names=1)
all_genes <- rownames(norm)
# check how big each list is
length(degs)
## [1] 2632
length(antiviral)
## [1] 43
length(all_genes)
## [1] 25931
# obtain true/false vectors for all genes based on intersection
# with degs or antiviral
antiviral_list <- all_genes %in% antiviral
degs_list <- all_genes %in% degs
# we can use table to confirm the number of genes in each
table(antiviral_list)
## antiviral_list
## FALSE  TRUE 
## 25888    43
table(degs_list)
## degs_list
## FALSE  TRUE 
## 23299  2632
# and we can use table to get the 2x2 contingency table
fet.table <- table(data.frame(antiviral_list, degs_list))
fet.table
##               degs_list
## antiviral_list FALSE  TRUE
##          FALSE 23294  2594
##          TRUE      5    38
# run fisher's exact test
fisher.test(fet.table)
## 
##  Fisher's Exact Test for Count Data
## 
## data:  fet.table
## p-value < 2.2e-16
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##   26.77896 223.26397
## sample estimates:
## odds ratio 
##   68.31723

Antiviral genes are very strongly enriched: odds ratio is ~70x, p-value is very small.

1.9.3 More efficient processing (bonus exercise)

Note: you may want to write a function to allow you do this calculation more rapidly. We will skip this for today, but here’s an example:

pathway_enrich <- function( degs, pathway, all_genes ){
  pathway_list <- all_genes %in% pathway
  degs_list <- all_genes %in% degs
  fet.table <- table(data.frame(pathway_list, degs_list))
  fet <- fisher.test(fet.table)
  # return the odds ratio and p-value as a vector
  result <- c(fet$p.value, fet$estimate)
  names(result)[1] = "p.value"
  return(result)
}
pathway_enrich(degs, antiviral, all_genes)
##      p.value   odds ratio 
## 7.968619e-33 6.831723e+01

1.10 Pathway Enrichment in GSEA

1.10.1 Prepare input .txt and .cls files

We will be using the same RNA-seq data set as before:

https://wd.cri.uic.edu/pathway/RNAseq_example.xlsx

Input .txt file prepared from the normalized expression table (“example_norm” tab). Again, for consistency we will obtain this file from RIC github repository. However, you can run the same exercise by saving tab-delimited text files from Excel and loading into R.

Open R Studio to prepare the input files. Use ENSEMBL IDs to go with GSEA’s built-in CHIP file for converting to gene symbols. Preparation steps:

  • Remove genes with average CPM < 1
  • Log-scale the expression
  1. Read from our server; the colClasses parameter tells R to ignore the columns named Gene.name
norm <- read.delim("https://wd.cri.uic.edu/pathway/RNAseq_norm.txt", row.names=1, colClasses=c(Gene.name="NULL"))
  1. Subset to CPM > 1
norm <- norm[rowMeans(norm) > 1,]
  1. log-scale with a pseudocount
norm <- log2(norm + 0.1)
  1. Duplicate two extra columns of the gene IDs and change the names of the new columns
norm <- cbind(rownames(norm), rownames(norm), norm)
colnames(norm)[1:2] = c("NAME","DESCRIPTION")
  1. Write the file
write.table(norm, "RNAseq_for_gsea.txt", col.names=T, row.names=F, quote=F, sep="\t")
  1. Make the .cls file. Use the append=T parameter in write.table to keep adding to the same file, and transpose vectors so they’re written along the rows.
groups <- c(rep("Control",4),rep("Infection",4))
write.table(t(c(8, 2, 1)), "RNAseq_for_gsea.cls", col.names=F, row.names=F)
write.table("# Control Infection", "RNAseq_for_gsea.cls", col.names=F, row.names=F,
  append=T, quote=F)
write.table(t(groups), "RNAseq_for_gsea.cls", col.names=F, row.names=F, append=T, quote=F)

NOTE: you can also download these completed files if you have trouble with the above steps:

https://wd.cri.uic.edu/pathway/RNAseq_for_gsea.txt

https://wd.cri.uic.edu/pathway/RNAseq_for_gsea.cls

1.10.2 Go to GSEA

Go to the GenePattern cloud website: https://cloud.genepattern.org/gp/pages/login.jsf

After logging in, enter “GSEA” under the Modules search box, and pick the first option (GSEA).

  • NOTE: if you’re using your own ranking, you should select the GSEAPreranked module.

1.10.3 Set up and run GSEA

  • expression dataset: Select our .txt file. The help info says to use a .gct or .res file, but .txt works also.
  • gene sets database: Let’s use the GO database. Select m5.go.bp.v2024.1.Mm.symbols.gmt [Gene Ontology].
    • Here’s where you could upload a .gmt file if you wanted to test a custom gene set.
  • number of permutations: Leave at 1000 for now, though you may want to consider 10,000 for more accurate p-values.
  • phenotype labels: Select our .cls file.
  • permutation type: Set to gene_set, since we have <7 replicates per group.
  • chip platform file: Browse to Mouse_ENSEMBL_Gene_ID_MSigDB.v2024.1.Mm.chip, to match the ENSEMBL IDs to gene symbols.
  • output file: Set to gsea_out.zip.
  • Then Run the job at the bottom.

You can close your browser while the job runs. When you log back in, you can check the status or see results.

1.10.4 Download results

After results are finished, browse to the Jobs tab and click on the run name. Choose Download Job on the right.

  • Note that you can also view the code to run GSEA in Java, MATLAB, R, or Python if you want to learn how.
  • For the impatient: You can download the results from our server as well…

https://wd.cri.uic.edu/pathway/gsea_out.zip

1.10.5 View results

Unpack the zip folder, and find the main result file, index.html, double-click to open in a browser.

  • There are two sets of results, one for gene sets up-regulated in Control, and another for gene sets up-regulated in Infection.
  • Click on the Detailed enrichment results in html or Excel to see details (note: Excel is really just tab-delimited text). Html is more interactive.

Within the detailed view:

  • See the FDR q-value for statistical significance
  • The pathway name link will take you to MSigDB
  • The Details link will take you to the leading edge figure and statistics, gene list, plus a heatmap

2 Afternoon

2.1 Barplot visualization

We’re using chart results from an analysis on DAVID.

Open R Studio to perform this analysis:

# read in the results from DAVID
# the quote="" helps us to parse pathway names that have ' or " in them
david <- read.delim("https://wd.cri.uic.edu/pathway/david_chart.txt",
  quote="")
# check how the names are interpreted in R
colnames(david)
##  [1] "Category"        "Term"            "Count"           "X."             
##  [5] "PValue"          "Genes"           "List.Total"      "Pop.Hits"       
##  [9] "Pop.Total"       "Fold.Enrichment" "Bonferroni"      "Benjamini"      
## [13] "FDR"
# sort by significance
david <- david[order(david$FDR),]
# for now, we'll focus on the KEGG pathways with FDR < 1%
david.kegg <- david[david$Category=="KEGG_PATHWAY" & david$FDR < 0.01,]
# log-scale the FDR and the enrichment ratio
david.kegg$logFDR <- -log10(david.kegg$FDR)
# the KEGG pathway names have IDs in them, fix it so that we just plot the name
head(david.kegg$Term)
## [1] "mmu04060:Cytokine-cytokine receptor interaction"
## [2] "mmu05140:Leishmaniasis"                         
## [3] "mmu04668:TNF signaling pathway"                 
## [4] "mmu05168:Herpes simplex infection"              
## [5] "mmu04380:Osteoclast differentiation"            
## [6] "mmu05162:Measles"
david.kegg$Term <- gsub("mmu[0-9]*:","",david.kegg$Term)
head(david.kegg$Term)
## [1] "Cytokine-cytokine receptor interaction"
## [2] "Leishmaniasis"                         
## [3] "TNF signaling pathway"                 
## [4] "Herpes simplex infection"              
## [5] "Osteoclast differentiation"            
## [6] "Measles"
# load the ggplot2 library
library(ggplot2)

NOTES FOR PLOT:

  • Use coord_flip() to plot the bars horizontally
  • Set the Term column as a factor with levels specified by the reverse order of the values so that the most significant is on top. This prevents ggplot2 from reordering the columns

2.1.1 Plot \(-\log_{10}\) FDR

david.kegg$Term<-factor(david.kegg$Term, levels = rev(david.kegg$Term))
ggplot(data=david.kegg, aes(x=Term, y=logFDR)) + geom_col() + coord_flip() + 
  labs(y="-log10 FDR")

Example of same plot using the scales and a reverse log scale function

library(scales)

# We only really need to do this once, but this will allow us 
# to quickly create a reverse log scale in ggplot2
revlog_trans <- trans_new("revlog",
                          function (x) { -log10(x) },
                          function (x) { 10 ^ (-x) }, log_breaks(base=10),
                          domain=c(1e-100, Inf))

ggplot(data=david.kegg, aes(x=Term, y=FDR)) + geom_col() + coord_flip() +
  scale_y_continuous("FDR corrected p-value", trans=revlog_trans)

2.1.2 Plot the fold enrichment

ggplot(data=david.kegg, aes(x=Term, y=Fold.Enrichment)) + geom_col() + coord_flip() + 
  labs(y="Fold Enrichment")

2.2 Pathway comparison plots

Compare pathway statistics for KEGG pathways in 3 different pathway enrichment results. These were generated on DAVID from three different gene lists generated from an RNA-seq clustering analysis.

For these files, we generated the Functional Annotation Chart in DAVID using a minimum of 1 molecule and a nominal p-value < 1, to include all terms in the comparison. We’ll filter for overall significance below.

2.2.1 Prepare combined data set

Download the three result files, and combine together into a single data frame. This takes a bit of futzing:

  • Fix column names
  • Merge data sets together and fix missing values
  • Subset to just significant terms
  • Log-scale FDR corrected p-values
  • Remove KEGG IDs from pathway names
  1. Download each file
kegg1 <- read.delim("https://wd.cri.uic.edu/pathway/cluster1_KEGG.txt",
  quote="")
kegg2 <- read.delim("https://wd.cri.uic.edu/pathway/cluster2_KEGG.txt",
  quote="")
kegg3 <- read.delim("https://wd.cri.uic.edu/pathway/cluster3_KEGG.txt",
  quote="")
  1. We’ll make the comparison on the basis of the FDR, so get just this value
kegg1_subset <- kegg1[,c("Term", "FDR")]
kegg2_subset <- kegg2[,c("Term", "FDR")]
kegg3_subset <- kegg3[,c("Term", "FDR")]
  1. Use the cluster name for the column names
colnames(kegg1_subset)[2] <- "Cluster1"
colnames(kegg2_subset)[2] <- "Cluster2"
colnames(kegg3_subset)[2] <- "Cluster3"
  1. Now we want to combine these results.
    • We could use merge() to combine 2 data frames, but we have 3 to combine.
    • The Reduce function repeatedly applies a given function to a list of inputs. So, this will apply merge across all three results
kegg_merged <- Reduce(function(x,y) merge(x=x, y=y, by="Term", all=T),
  list(kegg1_subset, kegg2_subset, kegg3_subset))
  1. Missing values are not significant, so set them to FDR = 1
head(kegg_merged)
kegg_merged[is.na(kegg_merged)] <- 1
  1. Remake this as a data frame
kegg_df <- data.frame(kegg_merged)
  1. Subset to the set of terms with FDR < 0.05 for at least one cluster
kegg_subset <- kegg_df[ apply(kegg_df[,-1], 1, min) < 0.05, ]
  1. Remove the term IDs in the names
kegg_subset$Term <- gsub("mmu[0-9]*:", "", kegg_subset$Term)
  1. This file is now nicely organized and scaled
head(kegg_subset)

2.2.2 Plot as side-by-side barplots

We need to do a bit more reformatting to work with ggplot2, as it requires data in the long format to plot in side-by-side barplots.

  1. Sort on most significant (lowest value). Could also use average significance: just replace “max” with “mean” in the apply.
kegg_subset2 <- kegg_subset[order(apply(kegg_subset[,-1], 1, min)), ]
  1. Convert to long format using the tidyr package
    1. First need to add back the row names as a column.
    2. Set the column name for the row names.
    3. Then use pivot_longer to convert the data frame into long format.
library(tidyr)
head(kegg_subset2)
kegg_long <- pivot_longer(kegg_subset2, !Term, names_to = "cluster")
head(kegg_long)
  1. Create barplot using ggplot2
    • fill=variable will color based on the different clusters in the variable column
    • position='dodge' will put the bars next to each other
    • scale_x_discrete, To order the items using the kegg_subset2 list.
    • scale_y_continous, This will use our custom made reverse log transformation for the scale.
library(ggplot2)
library(scales)

ggplot(data=kegg_long, aes(x=Term, y=value, fill=cluster)) + 
  geom_col(position='dodge') + coord_flip() +
  scale_x_discrete(limits=rev(kegg_subset2$Term)) + 
  scale_y_continuous("FDR corrected p-value", trans=revlog_trans)

  1. (OPTIONAL) Use geom_hline to create a red line at an FDR value of 0.05.
ggplot(data=kegg_long, aes(x=Term, y=value, fill=cluster)) + 
  geom_col(position='dodge') + coord_flip() +
  scale_x_discrete(limits=rev(kegg_subset2$Term)) + 
  scale_y_continuous("FDR corrected p-value", trans=revlog_trans) +
  geom_hline(yintercept = 0.05, color="red")

2.2.3 Plot as a heatmap

Also plot the prepared table as a heatmap.

  1. Make a color scale going from white (non significant) to red (most significant) with the circlize package set the middle of the color range to be 0.05 significance
library(circlize)
rownames(kegg_subset)<-kegg_subset$Term
kegg_subset$Term<-NULL
kegg_subset_revlog <- -log10(kegg_subset)
col_fun <- colorRamp2(c(0, -log10(0.05), max(unlist(kegg_subset_revlog))),
  c("white","yellow","red"))
  1. Plot the heatmap, with the legend (colorbar) on the left to avoid running into the pathway names also make the row names smaller
library(ComplexHeatmap)
heatmap <- Heatmap(as.matrix(kegg_subset_revlog), name="-log10 FDR", col=col_fun, 
  row_names_gp = gpar(fontsize = 5))
draw(heatmap, heatmap_legend_side = "left")

  1. (OPTIONAL) Replot with the breaks as the actual FDR corrected p value on the legend.
# We saw the breaks were at 0, 5, 10, and 15
breaks <- c(0, 5, 10, 15)

heatmap <- Heatmap(as.matrix(kegg_subset_revlog), 
                   col=col_fun, 
                   heatmap_legend_param = list(
                     at=breaks,
                     labels=10 ^ (-1 * breaks),
                     title="FDR"),
                   row_names_gp = gpar(fontsize = 5))
draw(heatmap, heatmap_legend_side = "left")

2.3 Heatmap of expression patterns in pathway (Take Home Exercise)

2.3.1 Read in data sets

NOTE: some of these are repeats of steps we ran before. Pick the top KEGG pathway from our DAVID results and make a heatmap of the expression levels for genes in that pathway.

Use the results from DAVID from before, and the corresponding RNA-seq data set.

  • Use the chart results from DAVID (david_chart.txt).
  • Use the normalized expression table (RNAseq_norm.txt).

For consistency we will download these from RIC github repository. However, you can run the same exercise with your own DAVID results as well.

# read in the results from DAVID and RNA-seq normalized data
david <- read.delim("https://wd.cri.uic.edu/pathway/david_chart.txt",
  quote="")
# sort by significance
david <- david[order(david$FDR),]
david.kegg <- david[david$Category=="KEGG_PATHWAY" & david$FDR < 0.01,]
# read in normalized expression
norm <- read.delim("https://wd.cri.uic.edu/pathway/RNAseq_norm.txt",
  row.names=1, colClasses=c(Gene.name="NULL"))

2.3.2 Make heatmap for top KEGG pathway

Subset normalized expression to the genes in the top KEGG pathway:

# name of the top pathway
top.kegg.name <- david.kegg[1,"Term"]
# remove the KEGG ID
top.kegg.name <- gsub("mmu[0-9]*:","",top.kegg.name)
# get the gene list for the top pathway
# first remove the white space
top.kegg.genes <- gsub(" ", "", david.kegg[1,"Genes"])
# use string split to split the list into a vector by commas
# use unlist to turn it back into a vector
top.kegg.genes <- unlist(strsplit(top.kegg.genes,","))
length(top.kegg.genes)
## [1] 94
# subset the norm table to these genes
top.kegg.norm <- norm[top.kegg.genes,]
dim(top.kegg.norm)
## [1] 94  8
# log-scale and z-score
top.kegg.norm <- log2(top.kegg.norm + 0.1)
top.kegg.norm <- t(scale(t(top.kegg.norm)))
# plot in a heatmap
library(ComplexHeatmap)
Heatmap(top.kegg.norm, name = top.kegg.name, show_row_names = FALSE)

2.4 Pathway analysis in variant calling data

This is a result from variant calling in a whole-exome variant calling data set (one sample, human). We’ll do a pathway analysis of mutated genes, using two different filtering strategies:

  • Damaging: Genes that contain non-synonymous variants with a predicted damaging effect (SIFT score).
  • Mutation burden: Genes that contain >3 non-synonymous variants.

NOTES:

SIFT (Sorting Interlant From Tolerant) is a model for predicting the effects of amino acid substitutions on protein function, based on sequence homology and known physical properties of amino acids. The SIFT model scores amino acid substitutions on a range from 0 to 1, where 0 is biggest effect, 1 is least effect. Damaging effects are predicted for scores < 0.05.

The filtering strategies outlined below are only a couple examples of how to filter variants. Other options include combining the strategies (damaging + mutation burden), filtering also based on the genotype call (heterozygous vs homozygous), using a different threshold for mutation burden, including other prediction models of damage to function (polyPhen, phyloP, etc.), and looking at known information about the variants in databases such as dbSNP, 1000 Genomes/ExAC/gnomAD, COSMIC (for cancer mutations), ClinVar, etc.

2.4.1 Filter variants in R (Take Home Exercise)

First, read in the variant table in R studio. It is available as a tab-delimited text file.

vars <- read.delim("https://wd.cri.uic.edu/pathway/example_variants.txt")
# look at vars in the R studio variable browser, or with head
head(vars)
# filtering strategy 1: damaging effects
# variants with SIFT prediction D for damaging, just the Gene column
filter1 <- as.character(vars[vars$SIFT_pred=="D","Gene"])
# some gene annotations have a comma-separated list of genes
# strsplit will split them, and unlist will give it all as a vector
filter1 <- unlist(strsplit(filter1,","))
filter1 <- unique(filter1)
head(filter1)
## [1] "SAMD11"  "PERM1"   "ATAD3B"  "SLC35E2" "CEP104"  "CA6"
length(filter1)
## [1] 806
# filtering strategy 2: mutation burden
# remove variants with synonymous or unknown change to translated sequence
filter2_all <- as.character(vars[vars$ExonicFunc!="synonymous_SNV" &
  vars$ExonicFunc!="unknown","Gene"])
# split by commas again
filter2_all <- unlist(strsplit(filter2_all,","))
# generate a count per gene using the table function
filter2_table <- table(filter2_all)
head(filter2_table)
## filter2_all
##    A1BG   A2ML1  A4GALT   A4GNT    AACS AADACL2 
##       1       2       1       1       1       1
# get the gene names from the table with counts bigger than 3
filter2 <- names(filter2_table)[filter2_table > 3]
head(filter2)
## [1] "ABCA13" "ACAN"   "ADGRF2" "ADGRV1" "AHNAK"  "AHNAK2"
length(filter2)
## [1] 245
# check the number of genes in common between the lists
length(intersect(filter1, filter2))
## [1] 122
# write both lists to tables
write.table(filter1,"filter_damaging.txt",col.names=F,row.names=F,quote=F)
write.table(filter2,"filter_mutation_load.txt",col.names=F,row.names=F,quote=F)

NOTE: these gene lists are available from our server as well:

2.4.2 Run pathway enrichment in DAVID for both lists

NOTE: the DAVID portion of this exercise may be left as homework, depending on workshop timing. The visualization parts of the exercise can be completed using DAVID results we have already generated.

Go to DAVID https://david.ncifcrf.gov, and follow similar steps as before:

  • Set the identifier as OFFICIAL_GENE_SYMBOL
  • Set as Gene List, and submit
    • It may take a little longer with uploading gene lists, as they will match to multiple species. DAVID will warn us about this, and we should select Homo Sapiens from the list.
  • NOTE: Open a second tab in your browser to upload a second gene list at the same time. Once uploaded, both will appear in the List Manager on DAVID.
  • After the upload, we’ll just look at the GOTERM_BP_FAT database. Click the “Chart” button next to this database to open the enrichment table.

In the chart view, note which dataset we’re looking at. Also modify the filters:

  • Change the filtering options to set count=1, EASE=1
    • This will maximize the number of terms in the list, so we have less missing data in the intersection later
    • We will filter for significance later in R
  • Right-click to download the chart report
  • Repeat with the other gene list
    • Confirm that the other list is active when you open the chart view for it
  • Save the files as:
    • GOBP_damaging.txt and GOBP_mutation_load.txt

2.4.3 Comparison visualization in R (Take Home Exercise)

Prepare a heatmap comparison of the two results in R Studio. For consistency, we will use the results stored on our server, but you can read your own results into R also.

# read in tables; remember quote="" helps with parsing
#   pathway names with quotes in their name
gobp.damaging <- read.delim(
  "https://wd.cri.uic.edu/pathway/GOBP_damaging.txt",
  quote="")
gobp.mutation_load <- read.delim(
  "https://wd.cri.uic.edu/pathway/GOBP_mutation_load.txt",
  quote="")
# process as we did before:
# subset the columns
sub.damaging <- gobp.damaging[,c("Term","FDR")]
sub.mutation_load <- gobp.mutation_load[,c("Term","FDR")]
# name the FDR column based on the gene list
colnames(sub.damaging)[2] <- "Damaging.Variants"
colnames(sub.mutation_load)[2] <- "Mutation.Load"
# merge and replace missing values with FDR = 1
gobp.merged <- merge(x=sub.damaging, y=sub.mutation_load, by="Term", all=T)
gobp.merged[is.na(gobp.merged)] <- 1
# prepare as a data frame with term IDs in the rownames
gobp.df <- data.frame(gobp.merged[,c(2:ncol(gobp.merged))])
rownames(gobp.df) <- gobp.merged[,1]
# subset to the significant terms and log-scale FDRs
gobp.subset <- gobp.df[apply(gobp.df, 1, min) < 0.05, ]
gobp.subset <- -log10(gobp.subset)
# remove the GO IDs from the term descriptions
rownames(gobp.subset) <- gsub("GO:[0-9]*~","",rownames(gobp.subset))
# plot in a heatmap, setting a color scale first
library(circlize)
col_fun <- colorRamp2(c(0, -log10(0.05), max(unlist(gobp.subset))),
  c("white","yellow","red"))
library(ComplexHeatmap)
heatmap <- Heatmap(as.matrix(gobp.subset), name="-log10 FDR", col=col_fun,
  row_names_gp = gpar(fontsize = 5))
draw(heatmap, heatmap_legend_side = "left")

2.5 Using STRING

2.5.2 Search by molecule of interest

  • Step 1: Click search button top right side of the menu.
  • Step 2: Select “Protein by name” in the left side menu
  • Step 3: Type “ER-beta” in the “Protein Name” search box
  • Step 4: Click “SEARCH” button

  • Step 1: Select the species of interest. For this exercise we will keep default selected choice “Homo sapiens”
  • Step 2: Click “continue” button

  • Review network and analysis details
    • Review Network Stats and Functional enrichment for the network.
    • Click on node/edge to get more details for a given interaction.


  • Click on “Legend”
    • Review edges colors for interaction type, e.g. pink = Experimental Evidence
    • Click on node/edge to get more details for a given interaction.

  • Changing the view settings:
    • For default network settings as shown in the above/below network. Please choose following options if your network looks different than the figure.
      • Step 1: Under “Active interaction sources:” Select all evidence
      • Step 2: Under “minimum required interaction score:” select “Medium confidence (0.400)”
      • Step 3: Under “max number of interactors to show:” select “1st shell - no more than 10 interactions”
      • Step 4: Click “UPDATE”” to update the network
    • Explore different settings add/remove extra nodes/edges (optional exercise)
      • Step 1: Under “Active interaction sources:” Select just “Experiments” as evidence
      • Step 2: Under “minimum required interaction score” select “Low confidence (0.150)”
      • Step 3: Under “max number of interactors to show:” 1st shell - no more than 10 interactions
        • Note: The 1st shell iteractors are the proteins directly associated with input protein(s)/query. 2nd shell of iteractors are the proteins associated with the proteins from the 1st shell or with input protein(s)/query.
      • Step 4: Click “UPDATE” to update the network


  • Export results: Click on Export tab
    • Check all download options
    • Check the interaction table.

2.5.3 Search by multiple proteins

Download antiviral gene list.

https://wd.cri.uic.edu/pathway/antiviral_list.txt

  • Step 1: Select “Multiple Protein” in the left side menu
  • Step 2: Upload the “antiviral_list.txt” by clicking “Browse”
  • Step 3: Organism: auto-detect
  • Step 4: Advanced Settings: Default
  • Step 5: Click “SEARCH” button

NOTE: If STRING asks you to select an organism, type in “Mus musculus”.

  • Step 1: Select proteins of interest. For this exercise we will keep all of the search results.
  • Step 2: Click “CONTINUE” button.

  • Adjustments to view settings (The network is shown using these settings):
    • Please choose following options if your network looks different than the figure.
      • Step 1: Under “Active interaction sources:” Select “Experiments” and “Neighborhood””
      • Step 2: Under “minimum required interaction score:” select “Medium confidence (0.400)”
      • Step 3: Under “max number of interactors to show:” select “- none/query proteins only -”
        • Note: The 1st shell interactors are the proteins directly associated with input protein(s)/query. 2nd shell of interactors are the proteins associated with the proteins from the 1st shell or with input protein(s)/query.
      • Step 4: Click “UPDATE”” to update the network
    • Observations.
      • There are small sub-connected networks and other are unconnected nodes (proteins)
      • Click on node/edge to get more details for a given interaction
      • Move around node(s) to get clear view

  • Additional information: Click on “Analysis” tab
    • Click on node/edge in the network to get more details for a given interaction
    • Observations: Network stats
      • There are 41 nodes and 18 edges.
  • Click on “Settings” again to update the network
    • Update the network by changing following settings.
      • Step 1: Under “Active interaction sources:” Select “Experiments” and “Neighborhood”
      • Step 2: Under “minimum required interaction score:” select “Medium confidence (0.400)”
      • Step 3: Under “max number of interactors to show:” select “no more than 5 interactions”
        • Note: The 1st shell iteractors are the proteins directly associated with input protein(s)/query. 2nd shell of iteractors are the proteins associated with the proteins from the 1st shell or with input protein(s)/query.
      • Step 4: Click “UPDATE”” to update the network
      • Move around nodes to see interactions. Move around single nodes
    • Observations
      • Tbk1: we see more nodes added around Tbk1

  • Stats: click on “Analysis” tab
    • Observations: Network stats
      • There are 47 nodes and 40 edges. Earlier we had 42 nodes and 26 edges.

2.6 Using STITCH

2.6.2 Search by molecule name

  • Estradiol (endogenous small molecule)
    • Step 1: Select “Item by name” in the left side menu
    • Step 2: Type “Estradiol” in the “Item Name” search box
    • Step 2A: Leave default “Organism: auto-detect”
    • Step 3: Click “SEARCH” button

  • For this exercise we will keep default selected choice “estradiol” in the chemical list.
  • Click CONTINUE button

  • Click on “Analysis”
    • Click on node/edge in the network to get more details for a given interaction.
    • Observations: Protein-Protein Network stats, Functional enrichment, etc.

2.6.3 Search by protein name

  • ER-beta (protein)
    • Select “Item by name” in the left side menu
    • Type “ER-beta” in the “Item Name” search box (leave default “Organism: auto-detect”)
    • Click “SEARCH” button

  • Keep default species as “Homo sapiens”.
  • Click “CONTINUE” button

  • Click on “Settings”
    • Review different settings for protein-chemical interaction

  • Click on “Analysis”
    • Click on node/edge in the network to get more details for a given interaction.
    • Observations: Protein-Protein Network stats, Functional enrichment, etc.

2.6.4 Search by drug name

  • Tamoxifen (drug)
    • Select “Item by name” in the left side menu
    • Type “Tamoxifen” in the “Item Name” search box (leave default “Organism: auto-detect”)
    • Click “SEARCH” button

  • Leave default species to “Homo sapiens”
  • Click “CONTINUE” button

  • Click on “Analysis”
    • Click on node/edge in the network to get more details for a given interaction.
    • Observations: Protein-Protein Network stats, Functional enrichment etc

2.7 Using miRNet

2.7.1 Prepare input data

Download the miRBase IDs

https://wd.cri.uic.edu/pathway/miR-ids.txt

2.7.3 Perform network analysis for 10 microRNAs

  • Step 1: Select “H. sapiens (human)” from “Organism”” dropdown
  • Step 2: Select “miRBase ID” from “ID type” dropdown
  • Step 3: Select “Exosomes” from “Tissue (humans only)” dropdown
  • Step 4: Select “Genes miRTarbase v9.0” and “lncRNAs” from “Targets” dropdown
  • Step 5: Copy paste following 10 miRBase IDs into miRNA list from mir-ids.txt
  • Step 6: Click Submit.
  • Step 7: Then click Proceed

2.7.4 Results overview

  • Step 1: Right click “Browse” for mir2gene to open results in new tab to review the table.
  • Step 2: Right click “Browse” for mir2lnc to open results in new tab to review the table.
  • Step 3: Click “Minimum Network” to get smaller network.
    • Observe number of edges in the Networks table (before clicking on minumum network)
    • There are 6196 edges in the network based on our query
    • Observe number of edges in the network table after “minimum network”
    • Number of edges reduced to 90
  • After filtering, click “Proceed”

2.7.5 Review Interaction Tables results: microRNA - Gene interactions

  • microRNA - Gene interactions (from “Browse” button)
    • Review Advanced filter for different options to filter based on Target, Method, Literature, etc.
    • Filtered results can be used further downstream to do network analysis

2.7.6 Review network

  • Review “minimum network”
  • Enrichment analysis options

2.8 INSTRUCTOR DEMONSTRATION: Network analysis using IPA

2.8.1 Build (Grow) a network from a single molecule

  1. Enter “ER-beta” in search dialog and click Search.

  1. Select “ESR2” and click Add To My Pathway and select New My Pathway.

  1. In the Network view click on the ESR2 node (rectangle) and click BUILD button.

  1. Select Grow as the Tool.
    1. In General Settings, set Grow out… …that are “Downstream of selected molecules”
    2. In Relationship Types, Click “Select all” to unselect all types and scroll down to select “transcription”
    3. In Node Types, Click “Select all” to unselect all types and scroll down to select “growth factor”
    4. Click APPLY

  1. Export molecules and relationships

  • Export All Molecules

  • Export All Relationships

2.8.2 Find connections between molecules

  1. Enter “ER-beta” in search dialog and click Search.
  2. Select “ESR2” and click Add To My Pathway and select New My Pathway.
  3. Enter “p53” in search dialog and click Search
  4. Select “TP53” and click Add To My Pathway and select bottom New My Pathway.

  1. In the Network view click BUILD button.
  2. Select Path Explorer as the Tool.
    1. Select “ESR2” in the view and click the ADD button for Set 1
    2. Select “TP53” in the view and click the ADD button for Set 2
    3. Set the Direction as “From Set A to Set B”
    4. In Species, click “Select all” to unselect all species and select “Human”
    5. In Node Types, click “Select all” to unselect all types and scroll down to select “transcription regulator”
    6. Click APPLY button.

  1. Click on the checkbox in the header row and click ADD TO MY PATHWAY

  1. Export molecules and relationships
  • Export All Molecules

  • Export All Relationships